52

Algorithms for Binary Neural Networks

large-scale problem on a huge data set [53]. We propose to solve the problem within the

backpropagation framework by considering: 1) the inference process of the optimized model

is based on the quantized variables, which means that the variable must be quantized in

the forward pass (corresponding to the inference) during training, and the loss is calculated

based on the quantized variables; the backpropagation is not necessarily to be quantized,

which however needs to consider the relationship between quantized variables and their

counterparts fully. Based on the above considerations, we propose that in the kth iteration,

based on the projection in Eq. 3.29, x[k] is quantized to ˆx[k] in the forward pass as

ˆx[k] = PΩ(ω, x[k]),

(3.30)

which is used to improve the backpropagation process by defining an objective as

min

f(ω, x)

s.t.

ˆx[k]

j

= P j

Ω(ωj, x),

(3.31)

where ωj, j ∈{1, ..., J} is the jth projection matrix3, and J is the total number of projection

matrices. To solve the problem in (3.31), we define our update rule as

xx[k] ηδ[k]

ˆx ,

(3.32)

where the superscript [k +1] is removed from x, δˆx is the gradient of f(ω, x) with respect to

x = ˆx, and η is the learning rate. The quantization process ˆx[k] x[k], that is, P j

Ω(ωj, x[k]),

is equivalent to finding the projection of ωj(x + ηδ[k]

ˆx ) onto Ω as

ˆx[k] = arg min

ˆx {∥ˆxωj(x + ηδ[k]

ˆx )2, ˆxΩ}.

(3.33)

Obviously, ˆx[k] is the solution to the problem in (3.33). So, by incorporating (3.33) into

f(ω, x), we obtain a new formulation for (3.31) based on the Lagrangian method as

min f(ω, x) + λ

2

J



j

ˆx[k] ωj(x + ηδ[k]

ˆx )2.

(3.34)

The newly added part (right) shown in (3.34) is a quadratic function and is referred to as

projection loss.

3.5.3

Theoretical Analysis

We take a close look at the projection loss in Eq. 3.34; we have

ˆx[k] ω(x + ηδ[k]

ˆx ) = ˆx[k]ωxωηδ[k]

ˆx .

(3.35)

In this case, we only consider one projection function, so the subscript j of ωj is omitted for

simplicity. For multiple projections, the analysis is given after that. In the forward step, only

the discrete kernel values participate in the calculation, so their gradients can be obtained

by

∂f(ω, ˆx[k])

ˆx[k]

= ωδ[k]

ˆx ,

(3.36)

3Since the kernel parameters x are represented as a matrix, ωj denotes a matrix as ω.